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Multidimensional item response theory (MIRT) is widely used in assessment and 
evaluation of educational and psychological tests. It models the individual response 
patterns by specifying a functional relationship between individuals’ multiple latent traits 
and their responses to test items. One major challenge in parameter estimation in MIRT is 
that the likelihood involves intractable multidimensional integrals due to the latent 
variable structure. Various methods have been proposed that involve either direct 
numerical approximations to the integrals or Monte Carlo simulations. However, these 
methods are known to be computationally demanding in high dimensions and rely on 
sampling data points from a posterior distribution. We propose a new Gaussian 
variational expectation--maximization (GVEM) algorithm which adopts variational 
inference to approximate the intractable marginal likelihood by a computationally 
feasible lower bound. In addition, the proposed algorithm can be applied to assess the 
dimensionality of the latent traits in an exploratory analysis. Simulation studies are 
conducted to demonstrate the computational efficiency and estimation precision of the 
new GVEM algorithm compared to the popular alternative Metropolis—Hastings 
Robbins—Monro algorithm. In addition, theoretical results are presented to establish 
the consistency of the estimator from the new GVEM algorithm. 


|. Introduction 


The increasing availability of rich educational survey data and the need to assess 
competencies in education pose great challenges to existing techniques used to handle 
and analyse the data, in particular when the data are collected from heterogeneous 
populations. Different forms of multilevel, multidimensional item response theory CMIRT) 
models have been proposed in recent decades to extract meaningful information from 
complex education data. The advancement of computational and statistical techniques, 
such as the adaptive Gaussian quadrature methods, the Metropolis—Hastings Robbins—- 
Monro (MHRM) algorithm, the stochastic expectation—maximization algorithm, and the 
fully Bayesian estimation methods, also help promote the use of MIRT models. However, 
even with these state-of-the-art algorithms, computation can still be time-consuming, 
especially when the number of factors is large. The main aim of this paper is to propose a 
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Gaussian Variational Estimation for Multidimensional 


Item Response Theory 


Abstract 


Multidimensional Item Response Theory (MIRT) is widely used in assessment and 
evaluation of educational and psychological tests. It models the individual response 
patterns by specifying functional relationship between individuals’ multiple latent traits 
and their responses to test items. One major challenge in parameter estimation in 
MIRT is that the likelihood involves intractable multidimensional integrals due to la- 
tent variable structure. Various methods have been proposed that either involve direct 
numerical approximations to the integrals or Monte Carlo simulations. However, these 
methods are known to be computationally demanding in high dimensions and rely on 
sampling data points from a posterior distribution. We propose a new Gaussian Varia- 
tional EM (GVEM) algorithm which adopts a variational inference to approximate the 
intractable marginal likelihood by a computationally feasible lower bound. In addi- 
tion, the proposed algorithm can be applied to assess the dimensionality of the latent 
traits in an exploratory analysis. Simulation studies are conducted to demonstrate 
the computational efficiency and estimation precision of the new GVEM algorithm in 
comparison to the popular alternative Metropolis-Hastings Robbins-Monro (MHRM) 
algorithm. In addition, theoretical results are also presented to establish the consis- 


tency of the estimator from the new GVEM algorithm. 


Keywords: Multidimensional IRT, Variational Inference, EM algorithm 


1 Introduction 


The increasing availability of rich educational survey data and the emerging needs of assess- 
ing competencies in education pose great challenges to existing techniques used to handle 
and analyze the data, in particular when the data are collected from heterogeneous popula- 
tions. Different forms of multilevel, multidimensional item response theory (MIRT) models 
have been proposed in the past decades to extract meaningful information from complex 
education data. The advancement of computational and statistical techniques, such as the 
adaptive Gaussian quadrature methods, the Metropolis-Hastings Robbins-Monro algorithm, 
the stochastic expectation maximization algorithm, or the fully Bayesian estimation meth- 
ods, also help promote the usage of the MIRT models. However, even with these state-of-the- 
art algorithms, the computation can still be time-consuming, especially when the number 
of factors is large. The main aim of this paper is to propose a new Gaussian variational 
expectation maximization (GVEM) algorithm for high-dimensional MIRT models. 

As summarized in Reckase (2009), the MIRT models contain two or more parameters 
to describe the interaction between the latent traits and the responses to test items. In 
this paper, we focus on the logistic model with dichotomous responses. Specifically for 
the multidimensional 2-Parameter Logistic (M2PL) model, there are N individuals who 
respond to J items independently with binary response variables Y;;, for 7 = 1,...,.N and 
j =1,...,J. Then the item response function of the 7th individual to the jth item is modeled 
by 

exp(aj 0; — b;) 


PW S110) = 1 
My | 93) 1+ exp(a} 6; — b;) o 


where a, denotes a k-dimensional vector of item discrimination parameters for the jth item 


and b; specifies the corresponding difficulty level with item difficulty parameter as b;/||a;||2. 
0; denotes the K-dimensional vector of latent ability for student 2. 

For the multidimensional 3-Parameter Logistic (M3PL) model, there is an additional 
parameter c;, which denotes the guessing probability of the jth test item. The item response 
function is expressed as 


exp(a@} 0; — b;) 
1 + exp(aj 0; — bj) 


P(Y;; =1|0;) =e; +(1—c) (2) 


For both the M2PL and M3PL models, denote all model parameters as M,. Then given 
the typical local independence assumption in IRT, the marginal log-likelihood of M, given 


the responses Y is 


N N J 
I(My:¥) = Yo log POY | My) = Soe f TT PO; 16:,0,)0(8,)08,. (3) 


where Y; = (Yi;,j = 1,..., J) is the ith subject’s response vector and J is the total number 
of items in the test. The ¢ denotes the AK-dimensional Gaussian distribution of 0 with mean 
0 and covariance “ig. The maximum likelihood estimators of the model parameters are then 
obtained from maximizing the log-likelihood function. However, due to the latent variable 
structure, maximizing the log-likelihood function involves a K dimensional integrals that are 
usually intractable. Direct numerical approximation to the integrals have been proposed in 
the literature, such as the Gauss-Hermite quadrature (Bock & Aitkin, 1981) and the Laplace 
approximation (Lindstrom & Bates, 1988; Tierney & Kadane, 1986; Wolfinger & O’connell, 


1993). However, the Gauss—Hermite quadrature approximation is known to become com- 


putationally demanding in the high-dimensional setting, which happens in MIRT especially 
when the dimension of latent traits increases. The Laplace approximation, though compu- 
tationally efficient, could become less accurate when the dimension increases or when the 
likelihood function is in skewed shape. Other numerical approximation methods based on 
Monte Carlo simulations have also been developed in the literature, such as the Monte 
Carlo expectation-maximization (McCulloch, 1997), stochastic expectation-maximization 
(von Davier & Sinharay, 2010), Metropolis-Hastings Robbins-Monro algorithms (Cai, 2010b, 
2010a). These methods usually depends on sampling data points from a posterior distri- 
bution and would be computationally involving. Recently, Zhang, Chen, and Liu (2020) 
proposed to use the stochastic EM algorithm (Celeux & Diebolt, 1985) for the item factor 
analysis, where an adaptive-rejection-based Gibbs sampler is still needed for the stochastic 
E step. Moreover, Chen, Li, and Zhang (2019) studied the joint maximum likelihood esti- 
mation by treating the latent abilities as fixed effect parameters instead of random variables 
as in (3). 

In this paper, we propose a computationally efficient method that is based on the vari- 
ational approximation to the log-likelihood. Variational approximation methods are main- 
stream methodology in computer science and statistical learning, and they have been applied 
to diverse areas including speech recognition, genetic linkage analysis, and document retrieval 
(Blei & Jordan, 2004; Titterington, 2004). Recently, there is an emerging interest in devel- 
oping and applying variational methods in statistics (Blei, Kucukelbir, & McAuliffe, 2017; 
Ormerod & Wand, 2010). In particular, Gaussian variational approximation methods were 
developed for standard generalized linear mixed effects models (GLMM) with nested random 
effects (Ormerod & Wand, 2012; Hall, Ormerod, & Wand, 2011). However, the variational 
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methods have only been slowly recognized in psychometrics and educational measurement, 
with the pioneer papers by Rijmen and Jeon (2013) as well as Jeon, Rijmen, and Rabe- 
Hesketh (2017). 

In essence, variational approximations refer to a family of deterministic techniques for 
making approximate inference for parameters in complex statistical models (Ormerod & 
Wand, 2010). The key is to approximate the intractable integrals (e.g. Eq.(3)) with a 
computational feasible form, known as the variational lower bound to the original marginal 
likelihood. In psychometrics, Rijmen and Jeon (2013) first developed a variational algorithm 
for a high dimensional IRT model, but their algorithm was limited to only discrete latent 
variables. Recently, Jeon et al. (2017) proposed a variational maximization-maximization 
(VMM) algorithm for maximum likelihood estimation of GLMMs with crossed random ef- 
fects. They showed that VMM outperformed Laplace approximation with small sample 
size. However, their study is limited in several respects: (i) They only considered the Rasch 
model. Although extending their algorithm to the 2PL model may be straightforward, its 
generalization to 3PL is unknown because 3PL does not belong to the GLMM family; (ii) 
The key component in their algorithm is the mean-field approximation (Parisi, 1988) that 
assumes independence of the latent variables given observed data. Even though it seems 
acceptable to assume independence of each random item effect, this independence assump- 
tion can no longer apply to the MIRT models when different dimensions are assumed to be 
correlated; (iii) In their first maximization step, the closed-form solution still contains a two- 
dimensional integration where adaptive quadrature is used; in the second maximization step, 
a Newton-Raphson algorithm is used. Therefore, both steps involve iterations, which may 


slow down the algorithm. Instead, our proposed GVEM algorithm has closed-form solutions 


for all parameters in both the E and M steps, and it can deal with high-dimensional MIRT 
models when the multiple latent traits are correlated. Moreover, the GVEM algorithm is 
established for both the M2PL and M3PL models. Consistency theory of the estimators from 
our proposed algorithm is established, and the performance of the algorithm is thoroughly 
evaluated via simulation studies. 

The rest of the paper is organized as follows. Section 2 introduces the general frame- 
work of the Gaussian Variational method and derivation of EM algorithm in MIRT models. 
Section 3 presents the GVEM algorithm for M2PL with the use of local variational approxi- 
mation and presents the theoretical properties of the proposed algorithm. Section 4 extends 
the GVEM algorithm to M3PL and also presents the stochastically optimized algorithm to 
further improve its computational efficiency. Section 5 and section 6 illustrate the perfor- 
mance of the proposed GVEM method with simulation studies and on real data, respectively. 
The paper is concluded with Section 7, which discusses any future steps. The Supplementary 
Material includes the detailed mathematical derivations of the EM steps and the proofs of 


the theorem and proposition. 


2 Gaussian Variational EM (GVEM) 


From here onwards, for the MIRT models in (1) and (2), we denote the model parameters 
by A S406. 9 Hd i S108 = Lt end. C Se. = dco. As detned 
in Section 1, we use the notation M, = {A,B,C} in the 3PL model and M, = {A,B} 
in the 2PL model for simplicity. Latent traits @ from different dimensions are correlated, 


resulting in a K by Kk covariance matrix Ng. To fix the origin and units of measurement, it 


is conventional to fix the mean and variance of all 8’s to be 0 or 1, respectively. To remove 
rotational indeterminacy in the exploratory analysis, (i.e. to ensure the model identifiability) 
researchers often assume ig = Ix and A contains a k-dimensional triangular matrix of zeros 
(Reckase, 2009). On the other hand, in the confirmatory analysis, the zero structure of the 
loading matrix A is completely or partially specified while the remaining nonzero elements 
are left unknown. In this case, the correlation of latent traits @ is of interest and we need to 
estimate the covariance matrix Mg. In this paper, we consider a general setting of Mg that 
works for both exploratory and confirmatory analyses. 

The idea of variational approximation is to approximate the intractable marginal likeli- 
hood function, which involves integration over the latent random variables, by a computa- 
tionally feasible lower bound. We follow the approach of variational inference (Bishop, 2006) 
to derive this lower bound. 


The marginal log-likelihood of responses Y is 
N N J 
i(M,; Y) = — log P(Y; | M,) = S— log / [][ Pi | 4:,.,)6(6,)46;, 
i=1 i=1 j=l 


where @ denotes a k-dimensional Gaussian distribution of 8 with mean 0 and covariance 


Xe. Note that the log-likelihood function 1(M,; Y) can be equivalently rewritten as 


N 
I(M,; Y) = Be) log P(Y; | My) x q;(0;)d0;, 
i=1 ° % 


for any arbitrary probability density function q; satisfying Se. qi(9;)d@; = 1. Since P(Y; | 


M,) = P(Yi,9; | M,p)/P(0:; | Yi, Mp), then we can further write 


I(Mp; Y) = Ss" 


x Gi(8;)d0; + KL{q;(8;)||P(8: | Yi, Mp) 


where K L{q;(0;)||P(0; | Yi, M,p)} = le. log re IT x q;(O;)dO; is the Kullback-Leibler (KL) 
distance between the distributions q;(0;) and P(6; | Y;, M,). The KL distance kK L{q;(0;)|| P( 
Y;, M,)} > 0 with the equality holds if and only if ¢;(0;) = P(@; | Yi, M,). Therefore, we 


have a lower bound of the marginal likelihood as 


I(M,; Y) 


IV 


S [ve ae ea Mp) x qi(O;)d8; (4) 


N 
= 3 | log P(Yi, 0; | M,) x qi(8:)d0; — S— | log qi(@;) x qi(0:)d0; 
i=1 7 9: j—1 2 9: 


and the equality holds when q;(0;) = P(@; | Yi, M,) for i=1,...,N. 

The follow-up question is how to design the candidate distribution function q;(@;) that 
gives the best approximation of the marginal likelihood. From the above argument, the best 
choice is the unknown posterior distribution function P(@; | Y;,.M,). Although this choice of 
qi(9;) is intractable, it provides a guideline to choose q;(@;) in the sense that a good choice 
of q;(8@;) must approximate P(@; | Y;,.M,) well. The well-known EM algorithm follows this 
idea and can be interpreted as a maximization-maximization (MM) algorithm (Hunter & 


Lange, 2004) based on the above decomposition. In particular, the E-step chooses q; to be 


6; | 


a distribution that minimizes the KL distance function, which corresponds to the estimated 
posterior distribution P(6; | Yi, Mp) with M, from the previous step estimates. The E-step 


then evaluates the expectation with respect to q;’s, i.e., 


N 
Sf toe. PO 8: |My) x a8, (6) 
i=1 4% 


which is equal to the lower bound in (4), except the additional constant term — 57”, Jo, 108 qi(i) x 
q;(0;)d0; that does not depend on model parameters M,. In the M-step, we maximize the 
above expectation term to estimate model parameters and this is equivalent to maximizing 
the lower bound in (4). 

However, one challenge in the EM algorithm is to evaluate the expectation in (5) with 
respect to the posterior distribution of 8;. In the MIRT model, it is known that this integral 
in (5) does not have an explicit form and in the literature, numerical approximation meth- 
ods are often used, such as the Gauss—Hermite approximation, Monte Carlo expectation- 
maximization (McCulloch, 1997), and stochastic expectation-maximization (von Davier & 
Sinharay, 2010). 

To avoid directly evaluating the posterior distribution of 0;, the variational inference 
method uses alternative choices of the q;(0;)’s to approximate the marginal likelihood func- 
tion. The choices of g;(@;) not only approximate the posterior P(6; | Y;, /,) well, but also 
are easy to compute and usually give closed form evaluations in the algorithm. In particular, 
from the MIRT literature, we know that as the number of items J becomes reasonably large, 
the posterior distribution P(@; | Y;,/,) can be well approximated by a Gaussian distri- 


bution (Bishop, 2006). Motivated by this observation, we use the Gaussian approximation 


procedure that chooses q;(@;) from a family of Gaussian distributions such that the KL dis- 
tance between q;(0;) and P(@; | Y;,M,) is minimized. The estimation is then taken as a 
two-step iterative procedure. In the variational E-step, we choose q;(@;) by minimizing the 
KL distance between q;(0;) and P(0; | Y;, M,) and evaluate the expectation of the likelihood 
function with respect qg;(0;), which is (5). In the M-step we update the unknown model 
parameters by maximizing the above expectation. The algorithm repeats the two steps until 
convergence. In the following sections, we present the detailed algorithm steps for the M2PL 


and M3PL models. 


3 GVEM for the M2PL Model 


In this section we present the GVEM algorithm for the M2PL model. Without loss of 
generality, we first focus on the 7th subject’s likelihood function due to the independence of 


different subjects’ responses. The joint distribution function of 0; and Y; is 


log P(Y;, 0; | A, B) = log P(Y; | 6;, A, B) + log 9(4;) 


J a5 
exp(a@; 6; — b;) 1 
Yij ? play + log 6(8; 
> gas exp(a:} 0; — b;) ( i) log 1+ exp(a} 6; — b;) 08 (91) 


j=l 
a 1 
- Yi;(a!0; —6;) +1 1] 0;). 


The difficulty of handling the marginal distribution of Y; mostly comes from the logistic 
sigmoid function, which makes the integration over @ not in a closed form in the E-step (i.e., 
Eq. (5)). 


To avoid dealing with intractable likelihood in E-step, we use a local variational method 


10 


initially proposed in the machine learning literature (Bishop, 2006; Jordan, Ghahramani, 
Jaakkola, & Saul, 1999), which finds bounds on functions over individual variables or groups 
of variables within a model instead of the full posterior distribution over all random variables. 
For notational simplicity, hereafter, we denote x;,; = b; — a} 0;. Because of the concavity 
of the logistic sigmoid function log(1/(1 + e~*)), by the local variational method, we have 


the following variational lower bound on the logistic sigmoid function, 


gay ene (2ij — &3) 2 2 
tens) = a 4 ea) exp {Busse mao Cr ee ay} 
ern (21,5 — &,5) 2 
+ ef) exp [eye — (&i,5)(%j5 — ays ; (6) 


where &;,; is a variational parameter that is introduced to approximate the objective function 
es /(1 +e), and n(&i3) = (2&;) 71 e&s/(1 + e&4) — 1/2]. We then aim to estimate the 
parameter €;,; that achieves the equality of the above display. By introducing an additional 
variational parameter €;;, we successfully avoid the problem of estimating the intractable 
integral in the E-step. The values of €;,;’s will be iteratively updated in the M-step. 

Using the lower bound on the logistic sigmoid function, we obtain a closed-form lower 


bound for log P(Y;, 8; | A, B) as follows 


log P(Vi,0:| A.B) > Yoga 


J 
~ So nlEig){(bj — aj 91)” — & 5} + log 4(0;) 
j=l 


= L(Y;, 6:, &; | A,B) 
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wher €2= (6.49 Stok) 
The key step is to find the optimal variational distribution q;(@;), which we describe in 


detail in the next section. 


3.1 Algorithm Details 


Choice of g; Conditional on the model parameters A, B and the variational parameters 
&,; fori = 1,...,N,j = 1,...,J, by the variational inference theory, it can be shown that 
the variational distributions q;(0;),i = 1,...,N that minimize the KL divergence with the 
posterior distributions P(@;|A, B),i =1,..., N take the following form: 
J J Ty-l 
lo a:(8:) oY (Ys — 5) FO: — Yo nlEis)(Bj — 7)? — 
j=l 


2 


j=l 


The standard nonlinear optimization technique is exploited to show that q;(@;) ~ N(0; | 
[4;, 4;) minimizes the KL divergence among all normal distributions where the mean and the 


covariance are 


J 
1 
ire S- {2n( fs )0p se Ya shay, (7) 
j=l 
¥) 
uy = Dg +2>_ nlG,slajay. (8) 
j=l 


With the variational densities q;(@;)’s, we aim to estimate model parameters &,’s, a;’s 
and b,’s by maximizing the lower bound of the marginal likelihood. Suppose we have €;,’s 
from a previous step’s estimation or the initial values, denoted by e. Similarly, define 


AY = {al j ae emery oF BO= {oj ae ey ge nO) ; and by (The EM iteration 


us 
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is presented below. 


E-Step In E-step, we evaluate the closed-form lower bound of the expected log likeli- 
hood with respect to the variational distributions q;’s. With iteratively updated varia- 


tional parameters ust ) and 5”, we easily evaluate the tth iteration’s lower bound of the 


4? 


expected log-likelihood. Denote the tth iteration’s variational density as q!?(0;) = q(0; | 


v 


6, AY BY, mM), Then, the tth iteration’s lower bound can be derived as 


BABE) = Df 10%,0.6,| AB) x al(0,)00, 
6 


N a I 1 1 
2 (t) (T(t) (t) 
= (tose + G Ya)? + (Hy - Pal Wl? — Fel 
i=1 j=l (1 + e%7) 
— fel) {8 — 28Paf 70 + al” + (ul?) "Jas? —€} 
N 
LN 1 
5 los le So sPre TEP + HH)". 
i=. 


M-Step In M-step, we maximize the estimated lower bound to update the model parame- 
ters (A, B,€, 9). This is achieved by simply setting the derivative of the lower bound with 
respect to (A, B,€, 9) to be zero. As a result, it can be shown that each update of the 
model parameters are done in a closed form, which makes the proposed GVEM algorithm 


computationally efficient. The updating step is presented below. The most recently updated 
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copies of the parameters are used for each iterative update. 


Lr ei 1 
aj = 5|donG)Ei+mG smn] So [(Ye-5+22nlGa)e], 
i=1 1=1 
8 Sea [GY +20. egra a 
ra 2n(Eis) 
iy = DF — Qbjagj i + gj [Ei + map! lors. (i) 


For the covariance matrix Ng, in the exploratory analysis, we can keep Ng = Ix during the 
GVEM estimation and then later performed proper rotation; in the confirmatory analysis, 


we update Ng by 


Ae 


N 
pe [Es + pipe; | (12) 


Note that if the Mg is assumed to be the correlation matrix with diagonals being 1, then 
we need to standardize the estimated Mg to get correlation matrix. Detailed derivations 
regarding the above EM steps are given in the Supplementary Material. 

In light of the above exposition, the GVEM algorithm for M2PL can be summarized as 


follows. 


Algorithm 1 GV-EM algorithm 
1: Initialize M® = {Ao, Bo}, €0 


2: repeat 
3: Estep: For step t > 1, update ult ) and a according to closed-form equations (7) 
and (8). 


4: M step: Further update Mo? and € according to closed-form equations (9), (10), 
and (11), iteratively. Fix 50 — = I in the exploratory analysis or update of according 
to (12) in the confirmatory analysis. 

5: until convergence 
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Remark 1 The algorithm complexity increases with the sample size N, which makes the 
algorithm computationally inefficient for large data sets. Thus, we can stochastically optimize 
the EM algorithm by sub-sampling the data to form noisy estimates of the variational lower 
bound and model parameters. Please refer to Section 4.2 for detailed explanation of the 


stochastic GVEM. 


Remark 2 Under the IRT framework, test dimensionality is one of the major issues explored 
in order to validate the design of a test and help practitioners with test development. As a 
byproduct of the algorithm, we can empirically estimate the number of latent dimensions from 
data. Specifically, the information criteria such as AIC or BIC can be used to compare the 
model fit with varying number of dimensions. Because we approximate the true log-likelihood 
by its lower bound in GVEM, the information criteria also need to be modified by replacing 
the true log-likelihood with the variational lower bound, resulting in the following modified 
AIC and BIC, denoted as AIC* and BIC*. The approximated information criteria are 
as follows, AIC* = 2(||Allo + ||Bllo + ||Xollo) — 2E(A, B,€) and BIC* = In(N)(||Allo + 
|Bllo + |[Zollo) — 2E(A, B,€) where E(A, B,€) is the estimated variational lower bound 
and A, B,€ are the final estimates from GVEM estimation procedure. The notation || Allo 
of matrix A denotes the zero norm of the matrix A, which is simply the number of non- 
zero entries of A. The advantage of using GVEM to estimate test dimensionality is that it 
is computationally more efficient especially under high dimensional data and more complex 
model. This procedure can be easily applied in both the 2PL and the 3PL models. Please see 


the simulation study for more discussions. 
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3.2 Theoretical Properties 


In this section, we establish theoretical bounds on the estimation of the model parameters 
under the high-dimensional setting where both N and J go to infinity. The dimension of 
latent traits, K’, is assumed known for this analysis and thus fixed. As defined in Section 2, 
A = [aj] 7x« denotes a matrix of factor loadings. Additionally, let © = [6;;]~« denote a 
matrix of random variables following q;(0;) and let 6 = [8:,] Nxk denote a matrix of estimated 


latent abilities from data. Define FE, 


@~q to be the expectation with respect to the estimated 


variational densities {4(0;) ~ N(fi:,¥;) : i = 1,...,N} from data. Lastly, a superscript 
* denote a true parameter. For example, 0; denotes the i” person’s true latent ability, 
which is a deterministic realization from its population distribution. We assume that the 
true parameters O* and A” satisfy 

(A1). ||67|? < C and ||a%||? < C for all i,7 for some positive constant C 


ea 
Theorem 1 derives the bound on the expected Frobenius norm of the error, ||OA — 


0*(A*)'||p, where ||M||p- = , [7 ;,; Mj denotes the Frobenius norm of a matrix M. 


Theorem 1 Suppose that condition (A1) is satisfied for the true parameters ©O* and A”. 
With optimally estimated variational densities q; from data and estimated parameter matrix 
A that maximizes the variational lower bound, there exists absolute constants C, and Cy 


such that 


1 


J+N [ log(N+ J) 
NJ 


JIN hey. 


aT 
E.qll[OA — 9*(A")' lr] < cre") 
is satisfied with probability 1—C\/(N + J). 
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The proof of Theorem 1 can be found in the Supplementary Material. 


Remark 3 Theorem 1 states that the expected estimation error measured by Frobenius norm 
goes to 0 as both N + co and J + co. The proof of Theorem 1 follows a similar argument 
from Davenport, Plan, Van Den Berg, and Wootters (2014) and Theorem 1 in Chen et al. 
(2019). However, the previous work by Chen et al. (2019) treats 0; as fixed effects while this 
work follows the conventional MIRT model setting with 0; random effects and following a 


normal population distribution. 


Remark 4 The Gaussian family as the candidate choice of q is reasonable according to 
Laplace approximation of the posterior distribution P(@;|Y;). The Laplace approximation 
of P(0@;|Y;) is a normal distribution with MLE 0; as mean and inverse of observed Fisher 
information I-1(0;) as variance. Denote 0; as the true parameter. By Bernstein-von Mises 
Theorem, since P(Y; | @;),4 = 1,...,N have same support and 0; — log P(Y; | 0;) is 
twice continuously differentiable, then 6, 6; almost surely and the Laplace approximated 
distribution N(0;,I~1(0;)) converges in distribution to the true limiting normal distribution 
N(0;,1~'(07)) as J — oo where I~'(6;) is the inverse of expected Fisher information. This 
supports our choice of variational density q; as a multivariate Gaussian distribution provides 


an asymptotically good approximation for the true posterior distribution of @. 


Remark 5 Compared with the existing stochastic estimation algorithms, such as the Metropolis- 
Hastings Robbins-Monro algorithm and the stochastic EM algorithm, the proposed estimation 
method has the advantage that each of the estimation iterations has simple closed-form update 
and it does not involve the stochastic samplings from some intermediate posterior distribu- 
tions as in the current stochastic estimation algorithms. As discussed in Remark 4, even 
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though variational distributions are used to approximate the posterior distributions in our 
method, the normal approximation is asymptotically valid. Simulation studies in Section 5 
further illustrate this. Moreover, the above variational EM development can be easily gener- 
alized to the M3PL model and can also be naturally combined with the idea of the stochastic 


EM, as illustrated in the next section. 


4 GVEM for the M3PL Model 


Derivation of the variational lower bound is trickier in the M3PL function since the cancella- 
tion of log and exponential function, which was essential in simplifying the variational lower 
bound in M2PL, is impossible due to the addition of a guessing parameter. To solve this 
problem, we introduce another latent variable, Z;; which is an indicator function of whether 
ith individual answered jth item based on their latent abilities or guessed it correctly (von 
Davier, 2009). We define Z;; = 1 if ith individual solved item j based on his or her latent 
ability, and Z;; = 0 if he or she guessed item j correctly. Notice here that for the case of 
Zi; = 1, Yi; can be either 0 or 1. However, when Z;; = 0, Y;; has to be 1 by the definition 


of Z,;. Hence, {Y;; = 0, Z,; = 0} cannot occur. 


Proposition 1 Given the two latent variables 0; and Z;;, then P(Y;; | 8;) under the follow- 
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ing hierarchical model is equivalent to (2) of the 3PL model. 


Zi; ~ Bernoulli(1 — c;), 


exp(aj 0; — b;) 
1 + exp(aj 6; — b;) 


Yi | 0,25 = 18 Bernoulli( | ), 


Yi; | 0:, Zi; = 0 ~ Bernoulli(I(Y;; = 1)). 


The distribution of observation Y;; given latent variables 0; and Z;; is then 


exp(a@j 0; — b;) 1 | 1-Yi; 


Vij 
P(Yig| Zig, 81) = { Fi + exp(a} 6; — mi Fi + exp(a} 6; — b;) 


Without loss of generality we first focus on the 7th subject’s likelihood function due to the 
independence of different subjects. Denote Z; = {Zi1, Zi2,..., Zs} and its distribution as 


(24) = TIj-1 p(Zis). Then the complete data likelihood of the ith subject is 


log P(Y;, 9;, Z; | A, B,C) 


= log P(Y; | 0;,2;, A,B,C) + log 6(0;) + log p(Z;) 


J T 
exp(a; 8; — b;) 1 
= ere j” (POV eal 
) gee fee ( )4ij 108 ees 


J 
+S 0 {(1 — Zj) log 1(¥ij = 1)} + log (6;) + log p(Zi) 
j=l 
Z 1 
= Y;;Z;;(a! 0; — b;) + Z;;1 EU =Z7./\logi(¥e= 1 
»{ cm) ig(Q; bj) + J °8 T+ exp(ay 6; — bj) ( 5) 0g ( J } 


+ log ¢(8;) + log p(Z;). 


Following the result from Proposition 1, the hierarchical formulation of the 3PL model 
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with the new latent variable Z;; could be used to derive the GVEM algorithm for the 3PL 
model. Please refer to the Supplementary Material for the proof of Proposition 1. Similar 
data augmentation scheme was proposed in Albert (1992) in the Bayesian framework. 

In this section, we derive the optimal choices of the variational densities for the latent 
variables Z;; and @;. The approach is similar to that of the 2PL model. For any arbitrary 
density functions gq; and r;,; of the latent variables 0; and Z;;, the following equation always 
holds 


log P(Y; | A, B,C) = | Slog P(Y; | A, B,C) x qi(Oi)ri(Zi)d6;. 
6; 


where r;(Z;) = bey rig (Zig) 


Note that P(Y; | A,B,C) = P(Y;,6;,Z;| A,B,C)/P(0;,Z; | Y;, A,B,C). We can 


write 


PY 5:03, 2; | A,B,C) 
Tag B,C) 


log P(Y;| A,B,C) = ne x qi(0;)ri(Zi)d0; 


ae ri(Z a) ||P(@i; 2. | Yi, A, B, C)}. 


Since the KL distance is > 0 by definition, we get a lower bound on the marginal likelihood 


similarly as in the 2PL model. 


log P(Y,; | A,B,C) > | Siow Pu, 6:, 2: | A,B,C) x a(0,)rs( Zi), (13) 


0: “Z, 


= : Log (ai(0,)ri(Zi)) x ail) Zi), (14) 
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Since (14) doesn’t depend on parameters A, B and C, we focus on (13) for the derivation 


of the lower bound. Again, the ith subject’s likelihood function is 


log P(Y;, 0;, Z; | A, B,C) 


J 
1 
= S Y,,Z;j(ae} 0; — b;) + Zyl + (1— Z,;\ log I(¥;; =1 
jFij(Q; i) + Zij °8 T+ expla]; — bj) ( j) log L(Y; 1 


j=l 
+ log 6(6;) + log p(Z;). 


Using the same variational lower bound (6) on the logistic sigmoid function as in the 2PL 


model, we show 


log P(Y;, 9;, Z; | A,B, C) 


IV 


J esis é - 1 T 
S- Lij log (1 + ea) + ys Lig Vig Q; 0; Fe b;) + 5 Zi (05 = a; 0; — 4) 
j=l j=l 


g=1 


J J 
= » Zisn(Ei,g {(bj — 067 81)” — Rj} + 2 {(1 — Z;;) log 1(Yij = 1)} 


+ log 6(0;) + log p(Z;) 


= U(Y;, i, Zi, &; | A,B, C). 


Recall that if Y;; = 0, then we always have Z;; = 1 by the design of our model. In other 
words, {Yj;, Zi;} = {0,0} cannot occur. To accommodate this constraint, we replace Z;; by 
Zi; =1=Y)j+ 277, so that Zi; =f, Y= 1 and Zi; = 1if Y,; = 0. This makes sure that 


the case of {Yj;,Zi;} = {0,0} is not included as a possible scenario during the estimation 
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procedure. By this substitution, we have 


J ea J 
e Jd 
= 5 - Vy t+ ZV) log G@ecen, (1—Yig + ZigVig)Yig (aaj i — 0) 
j=l j=l 
C1 
+) 5 (1 = Vig Zig Vig (bj Sa; 0; = & 3) 
j=l 
J 
— Si (1 = Vij + ZisVis)(Gug){ (bj — oj 81) — G5} 
j=l 
J J 
+ $7 {¥ig(1 — Zij) log (Viz = 1} + log 4(8;) + Slog p(Zi;) 
j=l j=l 


where log p(Zj;) = (1 — Yij + Zij¥iz) log(1 — ej) + Yij(1 — Zig) log (cj). 
With variational distributions q;’s and r;’s, we have the following expression for the 
variational lower bound of the marginal likelihood, which is an expectation of the joint 


distribution with respect to q;’s and 1;’s, i.e., 
N 
E(A, B,C, €) := ae So 101.4 Zi,€,| A,B,C) x r°(Z;)| x ¢!?(0,)d0;. (15) 
i=1 9% LZ, 


Appropriate choices of the variational distributions will lead to a closed form expression 
of the lower bound expressed in (15). As in the 2PL model, we choose the variational 
distributions for each latent variable by finding a distribution that best approximates the 


posterior distribution of each latent variable. 
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4.1 Algorithm Details 


Choice of g; Let E, denote the expectation with respect to. the variational densities of 


Zij'S, ie. 74;(Zi;)’s. We can write 


N 
E,(A, B,C, &) : =S0S 0%, 6, 2, & | A, B,C) x rij(Ziy) 
i=1 Zi; 
N J a J 
esis T 
= | UG Vy + BZ) ¥a) los Gey + DUG — Yu + BelZil Ys) ¥is(ers 8i — by) 
i=1 | j=1 j=1 
7 1 
+S °(1-¥j+ B,|Zis]Vij) 5 (0) — a} 0; — &;) 
j=l 
J 
— S01 — Yj + Br[Zigl¥is)n(Eig){(0; — 007 81)? — 2} 


Conditional on the model parameters A, B,C and the variational parameters €; where 7 = 
1,...,.N, by the variational inference theory, we can show that the variational distributions 
qi(9;),7 = 1,..., N that minimize the distances between them and the posterior distributions 
take the following form; 


J 
1 
log qi(@:) « wel — Viz + Ep(Zij) Vig) (Yi = 5)0} 8: 
=1 


&. 


yi 
1 
— S71 = Vij + Bol Zi¥ig)) (Gig) (Oj — 7 83)? — 507 Ep8,. 


j=1 
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The above likelihood function implies that q;(@;) ~ N(@; | i, 4%) where the mean and 


covariance are 


J 
1 
= ox S- {2n( 6 bev 5 5} = Vig E,(Zig)¥ig) Oz; (16) 
j=l 
J 
ae yee we — Viz + E,(Zij)Vig)n (Eig jar}. (17) 
j=l 


Choice of r;; We follow the similar steps as q. That is, we take the expectation of the 
lower bound I(Y;,0;,Z;,§; | A,B,C) with respect to the variational density of 6;, q;(0;) 
and derive the variational distributions for Z;;,2 = 1,...,N,j = 1,...,J. The variational 
distribution minimizes the distances between them and the posterior distributions of Z;, 
given model parameters A, B,C and the variational parameters &,. 

Let E, denote the expectation with respect to. the variational densities qg;’s and Ey, denote 
the expectation with respect to q;. Taking expectation of the lower bound I(Y;,0;, Z;,€; | 


A, B,C) with respect to q(0;), we have 


E,(A, B,C, &) 


N J J 
= S00 -¥5 + 24%) ow PO + = + Z9¥ (OT EQ —b) 
i=l j=l 


j=l 
J 
1 

a at a Zigig) 5 (bj — Oj Ea,|8i] — &s) 

j=l 

J 
— > ((1 — Vig + Zig¥is)n(Gig) {Bel (Oy — 0¢f 04)"] — Gy} 

es 
+E (1 — Z;;) log (Yi; = 1)} + E,,[log ¢(@ 1+ DownZ a) (18) 


24 


This implies that the variational distributions r;;(Z,,;) are 


esis 
(1 + e&) 


~nlGis Eal(6; ~ 07 8.7 ~ €2)} + og( ~ 6) 


1 
log rig(Zig) «x Lig Vig oe + ¥i;(aj Eq,[0i] — 63) + 5 (bi — aj Ey,[03] — &3) 


+¥4(1 = Z,)[ 10g 105 = 1) + lo(o) 


Thus, 7;;(Zi;) ~ Bernoulli(s;;) where s;; = 1 if Y;; =0 and 


1 ee een efi. + 
8; = 14 (2 166 exp { — Yi;(a,; Eq,|0i] — 63) + 
1 
5 (i — a} Ey,[0i) — &:3) — n(6:,3) {Eq [(bj — aj 4;)7] - 2th (19) 


With the chosen q;’s and r;;’s, we aim to estimate model parameters €, A, B and C, by 
maximizing the variational lower bound of the marginal likelihood, i.e., (15). The EM steps 


for 3PL model follow the same procedure as in 2PL case. 


E-Step In every E step, we choose the optimal variational distributions q;’s and r;;’s, which 
is equivalent to estimating variational parameters j1;, U;, and s;; for every 7 and j. With 
( 


iteratively updated variational parameters, (i.e. ul, 5, and ee) and most recent updates 


of model parameters (i.e. Me = {A®, B®, C}), we derive a closed form expression of 
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variational lower bound at tth step as follows; 


2 T 2 
= 589 — (EDO? — PAP" uP + (AP TEP + UP)WPY aP - 8 1) 


M-Step In this step, we again maximize the E (A,B,C, €) to update the parameters 
(A, B,C,€). This is achieved by setting the derivative of EO(A, B,C, €) with respect to 
(A, B,C, €) to be zero. Since we have a closed-form expression of the lower bound, updates 
of the model parameters are also in closed-form. Detailed derivation is provided in the 
Supplementary Material. 

For € and No, the update is the same as in 2PL model. For other parameters, we derive 


the updating rule by taking derivative of the variational lower bound F(A, B,C, €) derived 
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in E step. As a result, we have the following updating rule for a;, b; and c;; 


N 
1 = 
OG = 5 bye — Vig + 8ig Vig )m(i5) i + ptt] 
i=1 
. 1 
x ye (1 — Yiz + 8i3Yiy) (Ys ~ 5 Ey 2bjn(Gig)) ar (20) 
i=1 
N zi 
yeah Yu + su¥an |G = Ya) + 2nl6deg” "| - 
po y 
doer 2(1 — Vay + 8s ¥ig)(Eeg) 
N 
Sa 
Cp. Say RE =H >— ¥is(1 — 54). (22) 
doim(l — Vig + 8igVig) + Doin Vag — Sig Yay) A 
The Algorithm 2 summarizes the EM steps for GVEM algorithm in M3PL. 
Algorithm 2 GV-EM algorithm for M3PL 

1: Initialize My, = { Ap, Bo, Co}, €©. 
2: repeat 
3: E step: For step t > 1, update variational parameters utd ser nd sft? 


4 5) 


according to closed-form equations (16), (17), and (19). 


7] 


4: M step: Further update Mery according to closed-form equations (20), (21), and 


(22) iteratively. Update €“*) and ie same as in M2PL. 
5: until convergence 


Remark 6 The theoretical property of the M3PL is more challenging to derive rigorously 


due to the addition of the guessing parameters c;’s. From Theorem 2 in Davenport et al. 


(2014) we can show that the Hellinger distance of error between estimated probability dis- 


tributions and the true probability distributions 1s bounded above. For this discussion, we 


define Hellinger distance for probability distributions and matrices. Hellinger distance for 


two scalars p,q € [0,1] is defined as dz,(p,q) := (/p — J? + (V1 —p— V1—@)?. Fol- 


lowing Davenport et al. (2014), we also allow the Hellinger distance to act on matrices by 


averaging Hellinger distances over their entries. For matrices P,Q € [0,1]@*®, we de- 


2f 


fine B(P;Q) = Ee d?,(P.;,Qij). Let M = [Mij|nxs be the matrix with entries Mj; 


exp(M;;) oe (1—c;) exp(a} 0;—b;) 
+exp(Mj;) ~— 9 | J 1+exp(a; ;—b;) 


satisfying + . Let P(Y|M) be a matrix of probability dis- 
tributions P(Y;;|Mi;)’s where M;; denotes a collection of model parameters ay; ,b;,c;. Again, 


M* denotes a matrix of true parameters and M denotes estimated model parameters. Then 


by Theorem 2 of Davenport et al. (2014) 


cee a (N + J)log(N J) 


(POW), PCM") < Cxc/ NA DP 


Cy 


ae for absolute constants C, and Cy. Hence, the Hellinger distance 


with probability 1 — 


between estimated probability distribution and true probability distribution goes to 0 as both 
N + c and J > oo. However, the consistency result for model parameter {a;,b;,¢; + Jj = 


1,...,J} in M8PL is more challenging to derive and thus left for the future research. 


4.2 Stochastic Optimization of GVEM 


In M3PL, the proposed GVEM algorithm may become computationally inefficient as sample 
size increases because of the additional variational parameters and model parameters to esti- 
mate compared to M2PL. Especially in the E step, variational parameters (i.e. [4;, Uj, &,;, $i;) 
need to be optimized for every data points 7 = 1,...,N. Thus, the computational bur- 
den increases with larger sample size N. To improve the computational efficiency of the 
GVEM algorithm, we can stochastically optimize the variational approximation in the E 
step (Hoffman, Blei, Wang, & Paisley, 2013). That is, at each iteration of the E step, we 
subsample the data to form noisy estimate of the variational lower bound and iteratively up- 


date the estimate with a decreasing step size. Then M step in Algorithm 2 follows using this 
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stochastically estimated variational lower bound. The stochastic optimization only affects 
the E step, thus with minor changes to the original GVEM algorithm we can stochastically 
optimize the algorithm for M3PL. The noisy estimates of the variational lower bound are 
cheaper to compute as it only requires small subset of the data at each iteration. Also, for 
complicated models like M3PL, following such noisy estimates can also help the algorithm 
to escape local optima of complex objective functions. Specifically, the stochastic EM steps 


can be summarized as follows. 


Stochastic E step For step t > 1, choose a subset of data S; with desired size. Choose a 


) yO 


decreasing step size €,. Update u;", &;", e0 and si) for data point 2 € S; only, according to 
closed-form equations (16) and (17). Since we only update variational parameters for 7 € S;, 
the algorithm is computationally more efficient than GVEM approach without stochastic 
optimization, especially when the size of the subset 5; is chosen to be small. 


With updated variational parameters partially for 7 € S;, calculate noisy estimate of tth 


iteration’s expected variational lower bound Or as follows; 


Q: = a S10, 6;, Lik; | A, B, C) x r(Zi) x gf” (0;)d0; 
6; 


Zi 


Then we obtain a stochastic approximation of the variational lower bound by a weighted 
average of previous and current step’s noisy estimates of the lower bound, i.e. (1 — 6) QOi-1 + 


EQ. 
M step Once E step is done, we follow the previous M step. That is, estimate A” B”. 


C w and oe that maximizes the stochastic approximation of the variational lower bound. 
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Notice that this stochastic optimization idea is different from the stochastic component 
in the stochastic EM (StEM) algorithm (Nielsen, 2000). In the StEM algorithm, random 
samples of the unobserved latent variables 8; are drawn from the conditional distribution 
of 8; given observed variable Y;, and these random samples are used to approximate the 
otherwise intractable expectation in the E step. In our algorithm, the stochastic component 
instead refers to the random sub-sampling of the observed data {Yj;,7 = 1,...,.N} to form 
a noisy approximation of the variational lower bound E(A, B,C, €) in E step. 


In theory, if a sequence of step sizes satisfies the conditions such that 


S\6& = 00 and S\  < 00, (23) 


which results in a sequence of decreasing step sizes, the algorithms provably converge to an 
optimum (Robbins & Monro, 1951). Following the approach in Hoffman et al. (2013), we set 
the tth step size as e, = ({+7)~" where forget rate r € (0.5, 1] and delay 7 > 0. The forget 
rate controls how quickly old information is forgotten and the delay down-weights early 
iterations to decrease the effect of the earlier noisy estimations. This step size obviously 
satisfies the conditions (23). Thus the iterative stochastic optimization of E step converges 
to a local optimum of the variational lower bound. In simulation, we fix the delay to be one 
and try various forget rates as different values of delay didn’t play a big role for our model. 
Although in theory the stochastic optimization of GVEM converges to a stationary point 
for any valid forget rate r, the quality and speed of the convergence may depend on r in 


practice. 
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5 Simulations 


5.1 Design 


A series of simulation studies were conducted to evaluate the performance of the proposed 
GVEM algorithm in comparison to the Metropolis-Hastings Robbins-Monro (MHRM) al- 
gorithm implemented in the R package, ‘mirt’ (Chalmers, 2012). The Metropolis-Hastings 
sampler is used to draw missing data (which is 8 in MIRT) in the stochastic imputation 
step of the MHRM algorithm (Cai, 2008, 2010a). In the ‘mirt’ package, MHcand is a vec- 
tor of values used to tune the MH sampler, with larger values yielding lower acceptance 
rate. By default, these values are determined internally and adjusted on-the-fly, attempt- 
ing to tune the acceptance of the draws to be between .1 and .4. In addition, the default 
number of Metropolis-Hastings draws at each iteration is 5, which is considered sufficient 
by Cai (2010a). Only the exploratory item factor analysis will be presented since it is a 
computationally more challenging scenario than the confirmatory analysis. That is, in the 
confirmatory analysis, many of the item loading parameters (or discrimination parameters) 
are constrained to 0 based on the pre-specified item factor loading structure. Hence, the 
update equation for @ (i.e., (9) for the 2PL model and (20) for the 3PL model) only needs 
minimum updates to reflect the constraints specified in the factor loading structure. In the 
exploratory analysis, we do not assume any constraint on the item discrimination parameter 
A while fix <g = J, during the estimation. A post-hoc rotation can then follow to rotate 
the factors and allow them to be correlated. The best-known rotation methods available 
in most commercial software packages are varimax (Kaiser, 1958) in orthogonal rotation or 


promax (Hendrickson & White, 1964) in oblique rotation. Other popular methods include, 
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for instance, the CF-Quartimax rotation (Browne, 2001). In the simulations studies, the 
promax rotation was used such that the factors were allowed to be correlated. Both the 
M2PL and M83PL were considered in the simulation studies. The number of dimensions was 
fixed at 3 and test length was fixed at 45. 

Additionally, we compared the performance of GVEM to the joint maximum likelihood 
(JML) estimator given that the JML estimator is also shown to be consistent under the same 
high-dimensional setting presented in Theorem 1 and efficient (Chen et al., 2019). The JML 
estimator was computed using the default settings in the R package ‘mirtjml’ implemented 
by Chen et al. (2019). Since Chen et al. (2019) did not study M3PL, here we only compared 
the performances for M2PL. 

The manipulated conditions include: (i) multidimensional structure, i.e. between-item 
multidimensionality and within-item multidimensionality; (ii) correlations among the latent 
traits, and (iii) sample size. In particular, for the between-item multidimensional structure, 
there were 15 items loaded onto each factor; whereas for the within-item multidimensional 
structure, about one third of the items were loaded onto one, two, and three factors respec- 
tively. In all cases, item discrimination parameters were simulated from Unif(1,2) distri- 
bution, and difficulty parameter b; was simulated from the standard normal distribution. 
For the M3PL model, the true guessing parameters were fixed at 0.2 for all test items. The 
latent traits 8; were generated from multivariate normal distribution, N(0, 9), where No is 
a covariance matrix whose diagonal elements were 1 and the off-diagonals were drawn from 
Uniform distribution. For the high correlation condition, the correlations were drawn from 
Unif(0.5,0.7) and for the low correlation condition, they were drawn from Unif (0.1, 0.3). 
Sample size was set at either 200 or 500. 
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The convergence criterion for the GVEM algorithm is ||/,||2 < 0.0001, where ||,||2 
refers to the Lz norm of all model parameters. The number of Markov chain samples drawn 
in the MHRM algorithm is by default 5000 in the R. package ‘mirt’. Lastly, the JML method 
adopts sequential change in log-likelihood as the convergence criterion and the tolerance of 
convergence is by default 5 in the R package ‘mirtjml’. 100 replications were conducted 
for each condition. Evaluation criteria include the average bias, root mean squared error 
(RMSE), and computation time of both methods. The parameter recovery for Ng is calcu- 
lated by taking differences between each entries of the true Ng and estimated Ng. Both bias 
and RMSE were obtained for each model parameter across all items within a condition first 


and then averaged over 100 replications. 


5.2 Results for the M2PL model 


Figures 1 and 2 compare the distributions of bias and RMSE of the model parameters from 
the two methods under the four manipulated conditions for the between-item and within- 
item M2PL model respectively. As shown, GVEM generally produces comparable or more 
accurate model parameter estimates than MHRM run by the R package ‘mirt’ in all con- 
ditions for both between-item and within-item models. With respect to the manipulated 
conditions, increasing sample sizes helps reduce the RMSE and bias of the parameter esti- 
mates in both GVEM and MHRM in ‘mirt’. Moreover, the RMSE and bias are generally 
higher when the correlations among factors are higher. This may be because higher correla- 
tion introduce multicollinearity among factors, making the parameter recovery more difficult 


(Wang & Nydick, 2015). Last but not least, the parameter recovery from the between-item 
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multidimensional model is better than the parameter recovery from the within-item multi- 
dimensional model. This is not surprising since the loading structure A is more complex in 
the within-item model. Figures 3 and 4 compare the distribution of bias and RMSE of the 
model parameters from GVEM and the JML method under the four manipulated conditions 
for the between-item and within-item M2PL models respectively. We observe that GVEM 
produces much lower RMSE and bias than the JML estimation under all conditions for both 
between-item and within-item models. The performance of the JML estimator is especially 
worse in small sample and high correlation settings and under more complex within-item 
multidimensionality structure. This could be due to the fact that the JML estimator as- 
sumes 0,’s as fixed effects whereas GVEM models them as random effects with multivariate 
Gaussian distributions which account for the factor correlations. This result suggests that 
our proposed estimation method not only is theoretically consistent but also performs bet- 
ter in practice particularly under these complex simulation settings with correlated latent 
factors. 

Figure 5 shows the average computation times in seconds for GVEM and MHRM in 
‘mirt’ over 100 replications. To demonstrate a thorough comparison of the computation time, 
additional simulation settings were added for Figure 5; three different sample sizes (N = 
200, 500, and 1000) and three different test dimensions (K = 3, 4, and 5) were considered as 
the simulation settings, resulting in 9 conditions in total. Each column presents the results 
for the between-item and withinin-item model respectively. Overall, GVEM algorithm is 
computationally more efficient than MHRM in both low and high correlation settings with 
varying sample sizes. The most reduction in computation time was observed in between- 
item model with low correlation setting. Unsurprisingly, computation time increases for both 
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methods when the number of dimensions increases or when sample sizes increase. 

We would like to emphasize that the above observations regarding the MHRM algorithm 
are based on the implementation of the algorithm in the ‘mirt’ package under the default 
setting. Researchers using other packages may get slightly different results. We also tried 
other tuning methods in flerMIRT and found that a more careful tuning can improve the 
performance of MHRM as in ‘mirt’; on the other hand, the estimation results can be very 
sensitive to the tuning, and the optimal tuning of MHRM could vary case by case, depending 
on the model setting and the correlation of the latent traits. For instance, following one 
reviewer’s kind suggestion, we found that the strategy of combining mirt’s default Stage 3 
setup with flex MIRT’s default Stages 1 and 2 setup provides slightly better estimation results 
than the proposed GVEM under the high correlation and between item model setting (while 
still slightly worse under the low correlation and within item model settings); please see 
Figure 1 in the Supplementary Material. Based on these observations, we clarify that the 
simulation study does not intend to conclude that the proposed GVEM outperforms the 
MHRM algorithm, but rather to show that GVEM provides a good alternative estimation 
method for MIRT, which does not rely on much tuning. Thoroughly evaluating the optimal 
tuning of MHRM algorithm is an interesting research problem, yet it is beyond the scope of 


the current paper, and we would like to leave that as a future study. 


5.3. Results for the M3PL model 


For the M3PL model, the sample size and forget rate for stochastically optimized 3PL 


algorithm were chosen based on pilot testing of various sample sizes and forget rates. We 
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observed that using the whole data set for the initial estimation step helped a lot with 
the estimation precision. Hence the forget rate was fixed at a small value so that the 
information from entire data set in the first iteration was weighted more heavily in the 
subsequest iterations (ie. forget the information from entire data set slowly with small 
forget rate). After the first iteration, only 5 data points were sampled at a time, resulting 
in a huge reduction in computation time. 

Figures 6 and 7 present the distributions of bias and RMSE of the model parameters 
from the two methods under the four manipulated conditions for the between-item and 
within-item M3PL model, respectively. During simulation studies, we observed that the 
performance of MHRM was quite unstable and the model did not converge well in M3PL 
under all manipulated conditions. Specifically, model did not converge in about 30 to 45% of 
the total experiments in most conditions. In another 15 to 20% of the experiments, the model 
converged but the estimates of the model parameters exploded to surprisingly high values, 
which implies the instability of the parameter estimation. For MHRM method, we excluded 
these results from the total of 100 experiments and reported only the values that seem more 
meaningful. On the other hand, we report the results for all 100 experiments for the GVEM 
method. Precisely, in Figure 6, 40 cases for (a), 41 for (b), 28 for (c), and 40 for (d) were 
reported. In Figure 7, 48 cases for (a), 46 for (b), 54 for (c), and 47 for (d) were reported. 
Note again that in both Figures, we report all 100 experiments for GVEM method because 
they all converged successfully. Similarly as in the simulation studies for M2PL, increasing 
sample sizes helps reduce the RMSE and bias of the parameter estimates in both GVEM and 
MHRM. However, the RMSE for MHRM method is quite high with large variation under 
most conditions. Overall, we observe that for varying sample sizes and correlations between 
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latent traits, GVEM performs better than MHRM, even after excluding unstable estimation 
results for MHRM. Given that the inclusion of guessing parameter poses model estimation 
challenge is well-documented in literature (e.g, Lord, 1968; Thissen & Wainer, 1982; Yen, 
1987), it is not too surprising to note the large proportion of non-converged replications 
from MHRM. However, the stable performance of GVEM further reinforces its promise as a 
robust alternative method to the current status-quo, in particular when guessing parameter 
is included in the model. Also note that GVEM does not need much tuning for good 
performance, hence it is more accessible to broader audience who may not have the technical 
capacity to manually tune certain parameters, as may required by other algorithms. One 
last note to make is, for M3PL or 3PL models in general, marginal maximum a posteriori 
estimation (MMAP) is sometimes preferred over the maximum likelihood approach. That 
is, prior distributions are specified for constrained estimation of the a and c parameters to 
improve estimation stability (Kim, 2006). Therefore, one can compare GVEM with MMAP 


in a future study as well. 


5.4 Estimating the Number of Dimensions 


In this section, a separate simulation study was conducted to evaluate if AJC* and BIC” 
could help identify the correct number of factors from data. The simulation design is the same 
as illustrated in Section 5.1. The result is presented for different sample sizes and degrees of 
correlation between latent traits. A total of 100 independent samples were generated for each 
setting, and the proportion of replications in which the correct number of factors identified 


by AIC* and BIC* were recorded. 
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Table 1 and Table 2 present the correct estimation rate of the number of dimensions 
for M2PL and M3PL models respectively. As shown, increasing sample size help increase 
the correct estimation rate. In addition, similar to the findings in the previous sections, 
lower correlation is more preferable as it usually produced higher correct estimation rates. 
There is only one exception, though, for the within-item M3PL model, in which both AJC* 
and BIC* performed better for the higher correlation scenario regardless of the sample size. 
There is no appreciable difference between AJC* and BIC™ except a few cells in Table 1: 
AIC™® performed better than BIC* for large Ng with sample size of 200, whereas BIC* 


performed better for small “gy with sample size of 200. 


6 Real Data Analysis 


In this section, the GVEM and MHRM algorithms were used to conduct an exploratory 
item factor analysis on the National Education Longitudinal Study of 1988 (NELS:88) data. 
In this data set, a nationally representative sample of approximately 24,500 students were 
tracked via multidimensional cognitive batteries from 8th to 12th grade (the first three stud- 
ies) in years 1988, 1990, and 1992. In this study, we focused on the science and mathematics 
test data where the multidimensional factorial structure has been previously investigated 
(e.g, Kupermintz & Snow, 1997; Nussbaum, Hamilton, & Snow, 1997). For the science sub- 
ject, there are 25 items and four factors emerged from the data collected in 1988: “Elemen- 
tary science (ES), “Chemistry knowledge (CK), “Scientific reasoning (SR) and “Reasoning 
with knowledge (RK). For the math subject, there are 40 items in 1988 and two factors 


emerged, they are “Mathematical reasoning (MR) and “Mathematical knowledge (MK). We 
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pooled together data from both domains, resulting in 65 items and a complete sample size 
of N=13,488. Because the factor structure was analyzed using normal theory factor analysis 
more than two decades ago, we plan to reanalyze the data using the proposed new meth- 
ods. In addition, pooling together both math and science domains result in potentially high 
dimensional data. First, both GVEM and MHRM were conducted assuming the number of 
factors were 6. The focus is on the recovery of the correlation matrix Ng and its comparison 
between two methods. Since the exploratory item factor analysis was conducted, in both 
GVEM and MHRM we assumed that Ue = Jk during GVEM estimation and later performed 
the same promax rotation to estimate the correlation matrix +. Second, GVEM was used 
to explore the dimension of latent traits from the data. 

Table 3 shows the estimated “g from both methods assuming the number of factors is 
6. The correlations in Ng from two algorithms look comparable although most values from 
GVEM are slightly smaller than those from MHRM. The negative correlations on the last 
row, especially, are similar between two correlation matrices. Please note that No is invariant 
to the ordering of the latent traits (i.e., the factor labels are arbitrary), hence it is possible 
to reduce the differences between two matrices by further reordering their columns in Table 
BP 

To further explore the optimal number of factors from data, we applied the GVEM 
algorithm with the information criteria for dimension selection. Figure 8 presented the 
results of latent dimension selection under M2PL and M3PL models. By fitting the M2PL 
model to the data, the optimal dimensionality of the latent traits was estimated to be six by 
both AIC* and BIC™ as shown in Figure 8. This corresponds to the number of latent traits 
identified in prior research. However, the dimensionality of the latent traits was estimated to 
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be five under the M3PL model. This result implies that some of the six latent traits may be 
highly correlated under the M3PL model that they are merged. Comparing the information 
criteria values across both M2PL and M83PL, it appears that AJC* and BIC* were smallest 
for the M2PL model with six factors. Hence, our results further validate the number of 
latent factors that could be extracted from the NELS:88 data. In addition, it suggests that 
the guessing didn’t play a significant role in students’ performance on the math and science 


cognitive test data. 


7 Discussions 


Variational methods are first introduced in psychometrics by Rijmen and Jeon (2013) for 
high dimensional IRT model with discrete latent traits, and later by Jeon et al (2017) ina 
form of a variational maximization-maximization algorithm for GLMMs with crossed random 
effects. Although their findings demonstrate great promise of variational methods as they 
apply in psychometrics, their methods are not ready for calibrating high-dimensional MIRT 
models with correlated latent factors and guessing parameters. In this paper, a new method 
based on variational approximation is proposed for the parameter estimation in the M2PL 
and M3PL models. Compared to the existing methods, it has the advantage of avoiding 
the calculation of intractable log-likelihood by approximating the lower bound to the log- 
likelihood. It also greatly reduces the computation complexity by deriving the closed-form 
updates in the every EM step. Moreover, the efficiency of the algorithm is further improved 
in the stochastic version. Simulation studies demonstrate that the proposed methods show 


better performance in terms of parameter recovery and computation time in both M2PL and 
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M3PL compared to the widely used MHRM method. Theoretical results are provided on 
the convergence rate, which shows that the estimation error goes to 0 as both the sample 
size and number of test items go to infinity. As byproducts of the GVEM algorithm, both 
AIC* and BIC* could be used to help identify the optimal number of latent factors from 
data, as reflected by the simulation results. 

Although the current simulation study and data analysis focused on the exploratory item 
factor analysis, the GVEM algorithm can also be easily applied to the confirmatory item 
factor analysis. In the latter case, the loading matrix A will have structural 0’s implying 
that certain items are irrelevant to certain factors. Similar to the approach in Cai (2010b), 
these user-defined restrictions can be incorporated in the estimation via linear constraints. 
Reflecting in the GVEM algorithm, due to the closed-form solutions in the M-step, handling 
the structural 0’s basically means multiplying A by a same size matrix of binary entries with 
1’s indicating the corresponding element is estimable. 

This work does not study standard errors of the GVEM estimation procedure. However, 
one can derive standard errors of the model parameters similarly following the existing works 
(Jamshidian & Jennrich, 2000). Relevant future research is needed on exploring the accuracy 
and efficiency of the estimation of standard errors in the GVEM framework. In addition, 
extending the GVEM framework to polytomous response models and four parameter IRT 


models (Meng, Xu, Zhang, & Tao, 2019) would be of another interest for the future research. 
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between-item | within-item 
Correlation(Xg) | N | AIC* | BIC* | AIC* | BIC* 
200 76 92 69 94 
small 500 82 91 76 83 
1000 | 88 93 79 85 
200 59 25 69 58 
large 500 66 Al 82 81 
1000 | 83 52 84 89 


Table 1: Simulation: correct estimation rate(%) in the M2PL model 


between-item | within-item 
Correlation(Xg) | N | AIC* | BIC* | AIC* | BIC* 
200 47 A7 63 63 
small 500 83 87 93 93 
1000 93 93 84 84 
200 40 43 83 83 
large 500 60 60 97 97 
1000 73 is 97 97 


Table 2: Simulation: correct estimation rate(%) in the M3PL model 


GVEM MHRM 
1 1 
622 1 049 1 
066.298 1 697  .432 1 
A72 112 ~~ .426 i 635 .532 .682 1 
A89 =—.869 A424 248 1 697 478 .740 .544 1 
= 16% —.388: —.701 =.512 +595. 1 —.607 —.497 —.602 —.525 —.592 1 


Table 3: Real Data: comparison of estimated Ng 
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Figure 1: Parameter recovery of the between-item M2PL models from exploratory factor analysis 
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Figure 2: Parameter recovery of the within-item M2PL models from exploratory factor analysis 
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Figure 3: Parameter recovery of the between-item M2PL models from exploratory factor analysis 
using GVEM and Joint Maximum Likelihood (JML) estimator 
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Figure 4: Parameter recovery of the within-item M2PL models from exploratory factor analysis 
using GVEM and Joint Maximum Likelihood (JML) estimator 
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Figure 5: Average computation time for (a) between-item model (first column) and (b) within- 
item model (second column) with low correlation (first row) and high correlation (second row). 


51 


a. —H 7 ae a 
S 1.0 a oe ee a < 1.0 ae a 
oc oc a 
0.5- =: mie = 1, ce 0.5 5 -l. oe: = r 
0 0- 1 1 1 1 1 0.0 - 1 1 1 1 i 
Oy Op Og b c Le Oy Og Og b c Le 
(a) N = 200, Low Correlations (6) N = 200, High Correlations 
2.0 - i 2.0 - 
wy P57 gy 15 H 
Qio- +r Qio- Fo o« Ff 
oO boo oP or 1 
0.5- aM Hi J = 0.5- =. att = 
0.0 - 1 1 1 1 1 1 0.0 - 1 1 1 1 1 a 
Oy Qh Og b c Le Oy Qh Og b c Xo 
(c) N = 500, Low Correlations (d) N = 500, High Correlations 
— ee —-—t + 
0.5- : =| 0.5- m | 
7p) -r: ap) ee 
B oo- _ 1 x : © 99-7 +4 ¥ 
SOC ee [ar Be OCR EL BT) AF Be 
-0.5- = -0.5- cd 4 
di; dp 03 b CY di; Gp 0g b C Ye 
(a) N = 200, Low Correlations (6) N = 200, High Correlations 
0.5- 0.5- 
i eee 
i?2) bso = (7p) fr bo Ein ane 
w - a = w - Se H 
S 0.0-Seapee Oren = Es hp -ij = 
-0.5- ey -0.5- - 7 
di; dp 03 b CY di; dp 03 b C Yo 
(c) N = 500, Low Correlations (d) N = 500, High Correlations 


Methods Ea GVeEM 388 MHRM 


Figure 6: Parameter recovery of the between-item M3PL models from exploratory factor analysis. 
For MHRM, (a) 40, (b) 41, (c) 28, (d) 40 cases of simulation results were reported due to convergence 
issue. For GVEM, all 100 cases were reported under all conditions. 
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Figure 7: Parameter recovery of the within-item M3PL models from exploratory factor analysis. 
(a) 48, (b) 46, (c) 54, (d) 47 cases of simulation results were reported due to convergence issue. For 
GVEM, all 100 cases were reported under all conditions. 
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Figure 8: Real Data: BIC* for both M2PL and M3PL (AIC* shows the same trend). 
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